Skip to content

ENT-14108: cf-execd.service: drain cf-agent on stop#6146

Open
larsewi wants to merge 1 commit into
cfengine:masterfrom
larsewi:drain-cf-agent-systemd
Open

ENT-14108: cf-execd.service: drain cf-agent on stop#6146
larsewi wants to merge 1 commit into
cfengine:masterfrom
larsewi:drain-cf-agent-systemd

Conversation

@larsewi
Copy link
Copy Markdown
Contributor

@larsewi larsewi commented May 27, 2026

KillMode=process only signals cf-execd. Any cf-agent spawned by cf-execd keeps running after systemctl stop returns. A mid-run agent can then re-trigger cf-php-fpm (Wants=cf-postgres), causing dependencies to be pulled back in after the stop was reported successful.

This fix adds ExecStopPost= that waits up to 60s for cf-agent to drain, then SIGKILLs any survivor. It runs after cf-execd has exited, so no new agents are spawned during the drain.

Ticket: ENT-14108

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 27, 2026

@cf-bottom Jenkins please :)

@larsewi larsewi added the cherry-pick? Fixes which may need to be cherry-picked to LTS branches label May 27, 2026
@cf-bottom
Copy link
Copy Markdown

Copy link
Copy Markdown
Member

@nickanderson nickanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the 60s wait. Is it not possible for cf-agent to start cf-execd inside those 60s?

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 28, 2026

Not sure about the 60s wait. Is it not possible for cf-agent to start cf-execd inside those 60s?

Not sure I follow @nickanderson. Is it not cf-execd that starts cf-agent, and not the other way around?

With this fix: when you stop cf-execd, it now waits for cf-agent to finish. If it does not finish within 60 seconds, it gets killed.

This is to fix the issue where a lingering agent can start pulling in dependencies again after systemctl stop cfengine3 causing upgrades to fail. E.g., the cfengine3 umbrella stops postgres, the agent starts it again.

@nickanderson
Copy link
Copy Markdown
Member

Not sure I follow @nickanderson. Is it not cf-execd that starts cf-agent, and not the other way around?

Both things can be true. There is policy in the MPF that watches over CFEngine's own processes. But, this stuff is I think mostly skipped in the case of systemd. But for example:

  processes:

    !windows::

      "bin/cf-execd" -> { "CFE-2974" }
        restart_class => "cf_execd_not_running",
        comment => "If cf-execd isn't running, define a class so that it will be started",
        handle => "cfe_internal_limit_robot_agents_processes_cf_execd_not_running";

      "bin/cf-monitord" -> { "CFE-2963" }
        restart_class => "cf_monitord_not_running",
        handle => "cfe_internal_limit_robot_agents_classify_cf_monitord_not_running",
        comment => "We want cf-monitord to be running, but in order to avoid
                    non-convergent promises, this must be separated from the
                    promise to terminate misbehaving daemons";

  commands:

    cf_execd_not_running::

      "$(sys.cf_execd)"
        comment => "Restart cf-execd process",
        handle => "cfe_internal_limit_robot_agents_commands_restart_cf_execd";

    cf_monitord_not_running::

      "$(sys.cf_monitord)"
        comment => "Restart cf-monitord process",
        handle => "cfe_internal_limit_robot_agents_commands_restart_cf_monitord";

And there are some promises that target systemd, but notice that cf-execd is commented out because FUD.

  services:

    systemd::

      "cf-serverd"
        service_policy => "restart",
        if => "(server_controls_repaired|runagent_controls_repaired)";

      "cf-monitord"
        service_policy => "restart",
        if => "monitor_controls_repaired";

    systemd.enterprise_edition.(am_policy_hub|policy_server)::

      "cf-hub"
        service_policy => "restart",
        if => "hub_controls_repaired";


      # Well, this is dangerous we might kill our own agent
      # "cf-execd"
      #   service_policy => "restart",
      #   if => "(execd_controls_repaired|runagent_controls_repaired)";

I guess I am wondering what waiting for arbitrary time is really gaining us. If I systemctl stop cf-execd what is the real difference between waiting 2 seconds or 60 seconds neither is based on the actual system state or how long we expect an agent process to take.

@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 28, 2026

I guess I am wondering what waiting for arbitrary time is really gaining us. If I systemctl stop cf-execd what is the real difference between waiting 2 seconds or 60 seconds neither is based on the actual system state or how long we expect an agent process to take.

So what you're saying @nickanderson is; why not just kill the agent right away? I.e., instead of waiting for it to finish?

@nickanderson
Copy link
Copy Markdown
Member

So what you're saying @nickanderson is; why not just kill the agent right away? I.e., instead of waiting for it to finish?

Maybe I am, I dunno. I am probably just overthinking it. Why not give it at least 60s to finish up that's why I went ahead and approved it. Just it seemed arbitrary and I was looking for meaning.

Comment thread misc/systemd/cf-execd.service.in Outdated
`KillMode=process` only signals cf-execd. Any cf-agent spawned by
cf-execd keeps running after systemctl stop returns. A mid-run agent can
then re-trigger cf-php-fpm (`Wants=cf-postgres`), causing dependencies
to be pulled back in after the stop was reported successful.

This fix adds `ExecStopPost=` that waits up to 60s for cf-agent to
drain, then `SIGKILL`s any survivor. It runs after cf-execd has exited,
so no new agents are spawned during the drain.

Ticket: ENT-14108
Changelog: cf-execd systemctl stop now waits for in-flight cf-agent to finish
Signed-off-by: Lars Erik Wik <lars.erik.wik@northern.tech>
@larsewi larsewi force-pushed the drain-cf-agent-systemd branch from 3a97435 to cd78895 Compare May 29, 2026 13:50
@larsewi
Copy link
Copy Markdown
Contributor Author

larsewi commented May 29, 2026

@cf-bottom Jenkins please :)

@cf-bottom
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick? Fixes which may need to be cherry-picked to LTS branches

Development

Successfully merging this pull request may close these issues.

4 participants